Working with Big Data in Political Science

An introduction to using Google’s BigQuery with R

What problem are we solving?

  • Truly big data

    • Need to work outside of your computer’s memory
  • Need a tool that helps you store, manage, and work with these data

  • Don’t want to learn a whole new language

BigQuery

What is Google’s BigQuery?

  • Serverless data warehouse

  • Built-in query engine

bigrquery

R package that allows you to work with data stored in BigQuery through R.


To install:

install.packages(c("bigrquery", "DBI", "dplyr"))

To load into your current session:

library(bigrquery)
library(DBI)
library(dplyr)

How to store your data

BigQuery is hierarchical:

  • Tables are stored in:

    • Datasets, which are stored in:

      • Projects.

How to store your data

Tables

Tables

Tables

Uploading your data

We will step through uploading your data to BigQuery

Uploading data manually

Uploading data manually

Uploading data manually

Uploading data manually

Uploading data manually

Uploading data using bigrquery

library(bigrquery)
library(DBI)
library(dplyr)

List useful information:

selected_project <- "trade-dependence"
selected_dataset <- "bilatal_trade_hs6"

Uploading data using bigrquery

First, check that the table does not already exist:

bq_gdp_current <- bq_table(project = selected_project, 
                           dataset = selected_dataset, 
                           table = "gdp_current")

bq_table_exists(bq_gdp_current)
[1] FALSE

Uploading data using bigrquery

Next, create the (empty) table:

bq_table_create(
  bq_gdp_current,
  fields = gdp_current,
  friendly_name = "GDP (current USD)",
  description = "The data was extracted from the World Bank."
)

bq_table_exists(bq_gdp_current)
[1] TRUE

Uploading data using bigrquery

Next, upload your data to that empty table:

bq_table_upload(bq_gdp_current, gdp_df)

Working with your data

We will now step through how to work with big data out of your computer’s memory

Starting in R with dplyr

Let’s collect data on Australia’s GDP.

Starting in R with dplyr

These data are stored in the trade-dependence project and the country_annual_information dataset.

selected_project <- "trade-dependence"
selected_dataset <- "country_annual_information" 

con <- dbConnect(
  bigrquery::bigquery(),
  project = selected_project,
  dataset = selected_dataset,
  billing = selected_project
)

con
<BigQueryConnection>
  Dataset: trade-dependence.country_annual_information
  Billing: trade-dependence

Starting in R with dplyr

Create the connection to the reporter_gdp table:

gdp_df <- tbl(con, "reporter_gdp")
gdp_df
# Source:   table<reporter_gdp> [?? x 3]
# Database: BigQueryConnection
   year       reporter_code reporter_gdp_current
   <date>             <int>                <dbl>
 1 2003-01-01            92                   NA
 2 2003-01-01           136                   NA
 3 2003-01-01           531                   NA
 4 2003-01-01           292                   NA
 5 2003-01-01           408                   NA
 6 2003-01-01            NA                   NA
 7 2003-01-01           520                   NA
 8 2003-01-01           534                   NA
 9 2003-01-01           706                   NA
10 2003-01-01           728                   NA
# ℹ more rows

Starting in R with dplyr

Query that table:

aus_gdp <- gdp_df |> 
  filter(reporter_code == 36) |> 
  collect()

aus_gdp
# A tibble: 23 × 3
   year       reporter_code reporter_gdp_current
   <date>             <dbl>                <dbl>
 1 2002-01-01            36              3.96e11
 2 2018-01-01            36              1.43e12
 3 2009-01-01            36              9.29e11
 4 2011-01-01            36              1.40e12
 5 2004-01-01            36              6.14e11
 6 2021-01-01            36              1.55e12
 7 2017-01-01            36              1.33e12
 8 2008-01-01            36              1.06e12
 9 2001-01-01            36              3.79e11
10 2012-01-01            36              1.55e12
# ℹ 13 more rows

Moving between R and BigQuery

R will write your SQL queries for you:

gdp_df |> 
  filter(reporter_code == 36) |> 
  show_query()
<SQL>
SELECT `reporter_gdp`.*
FROM `reporter_gdp`
WHERE (`reporter_code` = 36.0)

Moving between R and BigQuery

You can perform that query in BigQuery’s in-built query engine:

Moving between R and BigQuery

You can perform that query in BigQuery’s in-built query engine:

Performing complex queries

Next steps

  • Partitioning and clustering your datasets
    • Great for yearly, country-level data
  • Integrated ML model-building